Chris Pollett > Old Classses > CS267
( Print View )

Student Corner:
  [Submit Sec1]
  [Grades Sec1]

  [Lecture Notes]
  [Discussion Board]

Course Info:
  [Texts & Links]
  [Description]
  [Course Outcomes]
  [Outcomes Matrix]
  [Course Schedule]
  [Grading]
  [Requirements/HW/Quizzes]
  [Class Protocols]
  [Exam Info]
  [Regrades]
  [University Policies]
  [Announcements]

HW Assignments:
  [Hw1] [Hw2] [Hw3]
  [Hw4] [Hw5] [Quizzes]

Practice Exams:
  [Mid1]  [Mid2]   [Final]

HW#1 --- last modified January 27 2019 04:55:54..

Solution set.

Due date: Sep 10

Files to be submitted:
Hw1.zip

Purpose: To gain some experience with language modeling, to dust off our programming skills, and to start work on coding an inverted index. To gain experience using an open source crawler. To practice evaluating search results.

Related Course Outcomes:

The main course outcomes covered by this assignment are:

LO1 - Code a basic inverted index capable of performing conjunctive queries.

LO6 - Be able to evaluate search results by hand and using TREC eval software.

Specification:

This homework has three parts, some text-based problems, a write-up of some experiments with Yioop, and a coding part. For each part, I will say file names of where I want you to put stuff. After you are done the homework take all the files you have created put them into a folder Hw1, zip it to make a file Hw1.zip, and submit that.

The first file I want you to make is the plain text file problems.txt. In it put your answers to the following questions:

Open Source Shakespeare has stats about Shakespeare's works. For example, there are 884,421 total words in Shakespeare's 43 works. Using this site you can calculate the maximum likelihood estimate for any term in Shakespeare. (a) Compute the MLE for each of the terms: (i) neither, (ii) a (iii) borrower, (i) nor, (v) lender, (vi) be. Then (b) compute using these the likelihood of the phrase: Neither a borrower nor a lender be.
Pick a paragraph out of your favorite Wikipedia page. For each term in this paragraph determine its frequency. Create a log scale plot of frequency versus rank. For this small a sample size, how well is Zipf's Law followed?
Using the transition matrix `M` given in the book for the Markov model in Figure 1.6, compute `(1, 0, 0, 0)M^{5}`, `(1, 0, 0, 0) M^{10}`, ..., `(1, 0, 0, 0)M^{50}` (might help to use a Math tool like Wolfram alpha, Mathematica or write a short program). Do they converge? If so, try to find a math result (by searching around online on say conditions for page rank converging or if you're a genius by hand) which might explain why or why not.

For the second part of the homework, I want you to download and install Yioop as described in class. I want you to conduct an experiment to compare the different Yioop's page summarizers available in Yioop. First, pick five Wikipedia pages of your choice. If possible, choose one of the pages to be in a language other than English in which you are fluent. For each page, come up with a list of queries that that page might be able to answer. Make sure your queries only involve words that appear on the given page. Next I want you to create by hand a short summary (say 5 sentences) for each of your Wikipedia pages. Your summaries should be in the same language as those pages. If you are working in a group of more than one person have a different person make the queries from the one who makes the summary. Next, for each of your summaries look at the list of five queries for the page the summary came from. Determine which queries have all their terms still in the summary and make a count of this for each page. For example, you might have wiki page 1 had 3 out 5 queries completely contained in the summary. Log in to the root account of your copy of Yioop, go to the Page Options activety. Here you can select between several summarizers for a web. I want you to conduct your experiments using the BASIC and Centroid Weighted Summarizers. Under Page Options go to the Test Options tab. For each Wikipedia page your choose, copy and paste its HTML source into the Test Options text area and click Test Process Page. Look at the output and determine the number of times Yioop got the Human language of the Wikipedia page correct. Then take the summaries you get from each Wikipedia page you chose and for each of the two summarizers, calculate as you did for your own human summaries the number of times the summary still contained all of the query terms for each of your queries. Write up all of these experiments saying which summarizer performed better, how it compared to a human summarizer, and maybe a paragraph conclusion in the file summaries.txt.

For Part 3, I want you to code a simple inverted index printer. It will be run from the command line with a line like:

aux_program IndexPrinter filename

Here aux_program is at most one of java, php, python (your choice). For php or python, IndexPrinter should be either IndexPrinter.php or IndexPrinter.py respectively If you choose to do this project in C or C++, aux_program is optional. Your program will be compiled on a Mac running MacOS High Sierra, having used brew to install the compiler. filename is the name of some txt document. This document is assumed to contain only ASCII characters with '\n' used for line endings. For this assignment, the sequence '\n\n' indicates a new "document" within this file. A "term" is any sequence of characters that does not include a white-space character. Once run, your code should read in this file into a data structure of your choice in memory. We assume for this homework the file is small enough to be read into memory. Your program should sort all terms seen in any document. For each term in this sorted order, your program should output to stdout, the term on a line by itself. On the next line it should output the number of documents the term appeared in, comma, the number of occurrences of the term across all documents, comma, and then a list a sequence of pairs in the form (doc_num, character position within the document that the term appears at). A snippet of this table might look like:

aadvark
2,4,(5,1),(8,6),(8,21),(8,30)
aaron
1,1,(2,1)
...

In the above (5,1) indicates the term aadvark occurred once in the fifth document in the file and it appeared at character offset 1. Once done outputting this data your program should stop. That is all there is to Part 3.

Point Breakdown

Part 1 (each problem 1pt - 1 fully correct, 0.5 any defect or incomplete, 0 didn't do or was wrong)	3pts
summaries.txt experiments.	2pts
Code is reasonably well-commented and structured	1pt
Reads the documents in the file into a reasonable data structure (1/2pt each)	1pt
Program creates an inverted index in memory (1pt keeps count of num_docs and frequency for each term, 1pt keeps track of information needed for each record associated with an entry).	2pts
Program outputs inverted index as described	1pt
Total	10pts